Introduction

Course Description

Libraries and Data

1. Introduction to Survey Data

2. Exploring categorical data

3. Exploring quantitative data

4. Modeling quantitative data

Introduction

My personal course notes from the Analyzing Survey Data in R course on DataCamp taught by Kelly McConville. Prof. Kelly’s personal website is Kelly McConville.

Course Description

You’ve taken a survey (or 1000) before, right? Have you ever wondered what goes into designing a survey and how survey responses are turned into actionable insights? Of course you have! In Analyzing Survey Data in R, you will work with surveys from A to Z, starting with common survey design structures, such as clustering and stratification, and will continue through to visualizing and analyzing survey results. You will model survey data from the National Health and Nutrition Examination Survey using R’s survey and tidyverse packages. Following the course, you will be able to successfully interpret survey results and finally find the answers to life’s burning questions!

What is Covered

Introduction to Survey Data
Exploring categorical data
Exploring quantitative data
Modeling quantitative data

Libraries and Data

library(tidyverse)
library(survey)
library(NHANES)
library(knitr)

ce <- read_csv("data/ce.csv")
api <- data(api, package = "survey")

1. Introduction to Survey Data

Our exploration of survey data will begin with survey weights. In this chapter, we will learn what survey weights are and why they are so important in survey data analysis. Another unique feature of survey data are how they were collected via clustering and stratification. We’ll practice specifying and exploring these sampling features for several survey datasets.

What are survey weights?

#Tidier alternative
ce %>%
  summarize(y_bar = mean(FINCBTAX))

## # A tibble: 1 × 1
##    y_bar
##    <dbl>
## 1 62480.

Survey Weights

Let’s look at the data from the Consumer Expenditure Survey and familiarize ourselves with the survey weights. Use the glimpse() function in the dplyr package to look at the ce dataset and check out the weights column, FINLWT21. ce and dplyr are pre-loaded.

Interpret the meaning of the third observation’s survey weight.

# glimpse(ce) : I have avoided displaying other columns  by just displaying the querried column below

glimpse(ce$FINLWT21)

##  num [1:6301] 25985 6581 20208 18078 20112 ...

Possible Answers:-

There are 20,208 people living in the third sampled household.
The third sampled household represents 20,208 US households.
The third household was sampled 20,208 times.
There are 20,208 households in the sample that are similar to the third household.

Great! Now that we know what the weights mean, let’s dig a little deeper!

Visualizing Weights

Graphics are a much better way of visualizing data than just staring at the raw data frame! Therefore, in this activity, we will load the data visualization package ggplot2 to create a histogram of the survey weights.

#ggplot2 is already loaded
# Construct a histogram of the weights
ggplot(ce, aes(FINLWT21)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Beautiful binning of those weights in your histogram! What shape would you say the weights follow? What does that imply about a few of the households from this survey?

Specifying elements of a Sampling design in R

Designs in R

In the next few exercises we will practice specifying sampling designs using different samples from the api dataset, located in the survey package. The api dataset contains the Academic Performance Index and demographic information for schools in California. The apisrs dataset is a simple random sample of schools from the api dataset. Let’s specify its design!

# Look at the apisrs dataset
glimpse(apisrs)

## Rows: 200
## Columns: 39
## $ cds      <chr> "15739081534155", "19642126066716", "30664493030640", "196445…
## $ stype    <fct> H, E, H, E, E, E, M, E, E, E, E, H, M, E, E, E, M, M, H, E, E…
## $ name     <chr> "McFarland High", "Stowers (Cecil ", "Brea-Olinda Hig", "Alam…
## $ sname    <chr> "McFarland High", "Stowers (Cecil B.) Elementary", "Brea-Olin…
## $ snum     <dbl> 1039, 1124, 2868, 1273, 4926, 2463, 2031, 1736, 2142, 4754, 1…
## $ dname    <chr> "McFarland Unified", "ABC Unified", "Brea-Olinda Unified", "D…
## $ dnum     <int> 432, 1, 79, 187, 640, 284, 401, 401, 470, 632, 401, 753, 784,…
## $ cname    <chr> "Kern", "Los Angeles", "Orange", "Los Angeles", "San Luis Obi…
## $ cnum     <int> 14, 18, 29, 18, 39, 18, 18, 18, 18, 37, 18, 24, 14, 1, 47, 18…
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcttest  <int> 98, 100, 98, 99, 99, 93, 98, 99, 100, 90, 95, 100, 97, 99, 98…
## $ api00    <int> 462, 878, 734, 772, 739, 835, 456, 506, 543, 649, 556, 671, 5…
## $ api99    <int> 448, 831, 742, 657, 719, 822, 472, 474, 458, 604, 575, 620, 5…
## $ target   <int> 18, NA, 3, 7, 4, NA, 16, 16, 17, 10, 11, 9, 14, 5, 15, 10, 6,…
## $ growth   <int> 14, 47, -8, 115, 20, 13, -16, 32, 85, 45, -19, 51, 4, 51, 50,…
## $ sch.wide <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, No, Yes, No, Y…
## $ comp.imp <fct> Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Y…
## $ both     <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Ye…
## $ awards   <fct> No, Yes, No, Yes, Yes, No, No, Yes, Yes, No, No, Yes, No, Yes…
## $ meals    <int> 44, 8, 10, 70, 43, 16, 81, 98, 94, 85, 81, 67, 77, 20, 70, 75…
## $ ell      <int> 31, 25, 10, 25, 12, 19, 40, 65, 65, 57, 4, 25, 32, 16, 23, 18…
## $ yr.rnd   <fct> NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA, NA, NA, NA, No, N…
## $ mobility <int> 6, 15, 7, 23, 12, 13, 22, 43, 15, 10, 20, 12, 4, 32, 17, 9, 1…
## $ acs.k3   <int> NA, 19, NA, 23, 20, 19, NA, 18, 19, 16, 16, NA, NA, 19, 21, 2…
## $ acs.46   <int> NA, 30, NA, NA, 29, 29, 30, 29, 32, 25, 27, NA, NA, 29, 30, 2…
## $ acs.core <int> 24, NA, 28, NA, NA, NA, 27, NA, NA, 30, NA, 17, 27, NA, NA, N…
## $ pct.resp <int> 82, 97, 95, 100, 91, 71, 49, 75, 99, 49, 62, 96, 77, 96, 39, …
## $ not.hsg  <int> 44, 4, 5, 37, 8, 1, 30, 49, 48, 23, 5, 44, 40, 4, 14, 18, 15,…
## $ hsg      <int> 34, 10, 9, 40, 21, 8, 27, 31, 34, 36, 38, 19, 34, 14, 57, 28,…
## $ some.col <int> 12, 23, 21, 14, 27, 20, 18, 15, 14, 14, 29, 17, 16, 25, 18, 2…
## $ col.grad <int> 7, 43, 41, 8, 34, 38, 22, 2, 4, 21, 24, 19, 8, 37, 10, 23, 28…
## $ grad.sch <int> 3, 21, 24, 1, 10, 34, 2, 3, 1, 6, 5, 2, 2, 19, 1, 3, 10, 32, …
## $ avg.ed   <dbl> 1.91, 3.66, 3.71, 1.96, 3.17, 3.96, 2.39, 1.79, 1.77, 2.51, 2…
## $ full     <int> 71, 90, 83, 85, 100, 75, 72, 69, 68, 81, 84, 100, 89, 95, 96,…
## $ emer     <int> 35, 10, 18, 18, 0, 20, 25, 22, 29, 7, 16, 0, 11, 5, 6, 10, 8,…
## $ enroll   <int> 477, 478, 1410, 342, 217, 258, 1274, 566, 645, 311, 328, 210,…
## $ api.stu  <int> 429, 420, 1287, 291, 189, 211, 1090, 353, 563, 258, 253, 166,…
## $ pw       <dbl> 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97…
## $ fpc      <dbl> 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6…

# Specify a simple random sampling for apisrs
apisrs_design <- svydesign(data = apisrs, 
                           weights = ~pw,
                           fpc = ~fpc, 
                           id = ~1)

# Produce a summary of the design
summary(apisrs_design)

## Independent Sampling design
## svydesign(data = apisrs, weights = ~pw, fpc = ~fpc, id = ~1)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03229 0.03229 0.03229 0.03229 0.03229 0.03229 
## Population size (PSUs): 6194 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc"

#or , apisrs_design %>% summary()

Stratified designs in R

Now let’s practice specifying a stratified sampling design, using the dataset apistrat. The schools are stratified based on the school type stype where E = Elementary, M = Middle, and H = High School. For each school type, a simple random sample of schools was taken.

# Glimpse the data
glimpse(apistrat)

## Rows: 200
## Columns: 39
## $ cds      <chr> "19647336097927", "19647336016018", "19648816021505", "196473…
## $ stype    <fct> E, E, E, E, E, E, E, E, E, E, M, M, H, M, H, E, E, M, M, E, E…
## $ name     <chr> "Open Magnet: Ce", "Belvedere Eleme", "Altadena Elemen", "Sot…
## $ sname    <chr> "Open Magnet: Center for Individual (Char", "Belvedere Elemen…
## $ snum     <dbl> 2077, 1622, 2236, 1921, 6140, 6077, 6071, 904, 4637, 4311, 41…
## $ dname    <chr> "Los Angeles Unified", "Los Angeles Unified", "Pasadena Unifi…
## $ dnum     <int> 401, 401, 541, 401, 460, 689, 689, 41, 702, 135, 590, 767, 25…
## $ cname    <chr> "Los Angeles", "Los Angeles", "Los Angeles", "Los Angeles", "…
## $ cnum     <int> 18, 18, 18, 18, 55, 55, 55, 14, 36, 36, 35, 32, 9, 1, 32, 18,…
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcttest  <int> 99, 100, 99, 100, 100, 100, 99, 98, 100, 100, 99, 99, 93, 95,…
## $ api00    <int> 840, 516, 531, 501, 720, 805, 778, 731, 592, 669, 496, 505, 4…
## $ api99    <int> 816, 476, 544, 457, 659, 780, 787, 731, 508, 658, 479, 499, 4…
## $ target   <int> NA, 16, 13, 17, 7, 1, 1, 3, 15, 7, 16, 15, 17, 20, 13, 18, 11…
## $ growth   <int> 24, 40, -13, 44, 61, 25, -9, 0, 84, 11, 17, 6, 7, 3, -10, 57,…
## $ sch.wide <fct> Yes, Yes, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, No, No, N…
## $ comp.imp <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, No, No, …
## $ both     <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, No, No, …
## $ awards   <fct> No, Yes, No, Yes, Yes, Yes, No, No, Yes, No, No, No, No, No, …
## $ meals    <int> 33, 98, 64, 83, 26, 7, 9, 45, 75, 47, 69, 60, 66, 54, 35, 95,…
## $ ell      <int> 25, 77, 23, 63, 17, 0, 2, 2, 58, 23, 25, 10, 43, 26, 7, 66, 7…
## $ yr.rnd   <fct> No, Yes, No, No, No, No, No, No, Yes, No, No, No, No, No, No,…
## $ mobility <int> 11, 26, 17, 13, 31, 12, 10, 15, 23, 19, 26, 22, 16, 44, 18, 1…
## $ acs.k3   <int> 20, 19, 20, 17, 20, 19, 19, 19, 20, 18, NA, NA, NA, NA, NA, 1…
## $ acs.46   <int> 29, 28, 30, 30, 30, 29, 31, 31, 32, 29, 32, 32, NA, 32, NA, 3…
## $ acs.core <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 30, 32, 27, 29, 28, N…
## $ pct.resp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 87, 67, 50, 70, 71, 2, 91, 0…
## $ not.hsg  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31, 49, 12, 20, 45, 9, 22, 0…
## $ hsg      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 34, 20, 33, 20, 36, 64, 20, …
## $ some.col <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 15, 23, 31, 11, 18, 32, …
## $ col.grad <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 12, 29, 23, 8, 9, 16, 10…
## $ grad.sch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 3, 6, 0, 0, 11, 0, 2, …
## $ avg.ed   <dbl> 3.32, 1.67, 2.34, 1.86, 3.17, 3.64, 3.55, 3.10, 2.17, 2.82, 2…
## $ full     <int> 100, 57, 81, 64, 90, 95, 96, 93, 91, 96, 84, 65, 93, 55, 80, …
## $ emer     <int> 0, 40, 26, 24, 7, 0, 0, 8, 14, 0, 18, 37, 17, 26, 19, 33, 3, …
## $ enroll   <int> 276, 841, 441, 298, 354, 330, 385, 583, 763, 381, 1293, 1043,…
## $ api.stu  <int> 241, 631, 415, 288, 319, 315, 363, 510, 652, 322, 1035, 815, …
## $ pw       <dbl> 44.21, 44.21, 44.21, 44.21, 44.21, 44.21, 44.21, 44.21, 44.21…
## $ fpc      <dbl> 4421, 4421, 4421, 4421, 4421, 4421, 4421, 4421, 4421, 4421, 1…

# Summarize strata sample sizes
apistrat %>%
  count(stype)

##   stype   n
## 1     E 100
## 2     H  50
## 3     M  50

# Specify the design
apistrat_design <- svydesign(data = apistrat,
                             weights = ~pw, 
                             fpc = ~fpc, 
                             id = ~1, 
                             strata = ~stype)
# Look at the summary information stored in the design object
summary(apistrat_design)

## Stratified Independent Sampling design
## svydesign(data = apistrat, weights = ~pw, fpc = ~fpc, id = ~1, 
##     strata = ~stype)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02262 0.02262 0.03587 0.04014 0.05339 0.06623 
## Stratum Sizes: 
##              E  H  M
## obs        100 50 50
## design.PSU 100 50 50
## actual.PSU 100 50 50
## Population stratum sizes (PSUs): 
##    E    H    M 
## 4421  755 1018 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc"

Cluster designs in R

Now let’s practice specifying a cluster sampling design, using the dataset apiclus2. The schools were clustered based on school districts, dnum. Within a sampled school district, 5 schools were randomly selected for the sample. The schools are denoted by snum. The number of districts is given by fpc1 and the number of schools in the sampled districts is given by fpc2.

# Glimpse the data
glimpse(apiclus2)

## Rows: 126
## Columns: 40
## $ cds      <chr> "31667796031017", "55751846054837", "41688746043517", "416887…
## $ stype    <fct> E, E, E, M, E, E, E, E, M, H, E, M, E, E, E, E, H, E, E, M, E…
## $ name     <chr> "Alta-Dutch Flat", "Tenaya Elementa", "Panorama Elemen", "Lip…
## $ sname    <chr> "Alta-Dutch Flat Elementary", "Tenaya Elementary", "Panorama …
## $ snum     <dbl> 3269, 5979, 4958, 4957, 4956, 4915, 2548, 2550, 2549, 348, 34…
## $ dname    <chr> "Alta-Dutch Flat Elem", "Big Oak Flat-Grvlnd Unif", "Brisbane…
## $ dnum     <int> 15, 63, 83, 83, 83, 117, 132, 132, 132, 152, 152, 152, 173, 1…
## $ cname    <chr> "Placer", "Tuolumne", "San Mateo", "San Mateo", "San Mateo", …
## $ cnum     <int> 30, 54, 40, 40, 40, 39, 19, 19, 19, 5, 5, 5, 36, 36, 36, 36, …
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcttest  <int> 100, 100, 98, 100, 98, 100, 100, 100, 100, 96, 98, 100, 100, …
## $ api00    <int> 821, 773, 600, 740, 716, 811, 472, 520, 568, 591, 544, 612, 9…
## $ api99    <int> 785, 718, 632, 740, 711, 779, 432, 494, 589, 585, 554, 583, 9…
## $ target   <int> 1, 4, 8, 3, 4, 1, 18, 15, 11, 11, 12, 11, NA, NA, NA, NA, 18,…
## $ growth   <int> 36, 55, -32, 0, 5, 32, 40, 26, -21, 6, -10, 29, 14, 2, 30, -5…
## $ sch.wide <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, Y…
## $ comp.imp <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, Y…
## $ both     <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, Y…
## $ awards   <fct> Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, Y…
## $ meals    <int> 27, 43, 33, 11, 5, 25, 78, 76, 68, 42, 63, 54, 0, 4, 1, 6, 47…
## $ ell      <int> 0, 0, 5, 4, 2, 5, 38, 34, 34, 23, 42, 24, 3, 6, 2, 1, 37, 14,…
## $ yr.rnd   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ mobility <int> 14, 12, 9, 8, 6, 19, 13, 13, 15, 4, 15, 15, 24, 19, 14, 14, 7…
## $ acs.k3   <int> 17, 18, 19, NA, 18, 20, 19, 25, NA, NA, 20, NA, 19, 18, 19, 1…
## $ acs.46   <int> 20, 34, 29, 30, 28, 22, NA, 23, 24, NA, NA, 27, 27, 25, 27, 2…
## $ acs.core <int> NA, NA, NA, 24, NA, 31, NA, NA, 25, 21, NA, 18, NA, NA, NA, N…
## $ pct.resp <int> 89, 98, 79, 96, 98, 93, 100, 46, 91, 94, 93, 88, 90, 99, 0, 8…
## $ not.hsg  <int> 4, 8, 8, 5, 3, 5, 48, 30, 63, 20, 29, 27, 0, 1, 0, 1, 50, 24,…
## $ hsg      <int> 16, 33, 28, 27, 14, 9, 32, 27, 16, 18, 32, 25, 0, 7, 0, 5, 21…
## $ some.col <int> 53, 37, 30, 35, 22, 30, 15, 21, 13, 27, 26, 24, 4, 8, 0, 8, 1…
## $ col.grad <int> 21, 15, 32, 27, 58, 37, 4, 13, 6, 28, 7, 18, 51, 42, 0, 42, 1…
## $ grad.sch <int> 6, 7, 1, 6, 3, 19, 1, 9, 2, 7, 6, 7, 44, 41, 100, 45, 1, 6, 3…
## $ avg.ed   <dbl> 3.07, 2.79, 2.90, 3.03, 3.44, 3.56, 1.77, 2.42, 1.68, 2.84, 2…
## $ full     <int> 100, 100, 100, 82, 100, 94, 96, 86, 75, 100, 100, 97, 100, 10…
## $ emer     <int> 0, 0, 0, 18, 8, 6, 8, 24, 21, 4, 4, 3, 0, 4, 0, 4, 28, 18, 11…
## $ enroll   <int> 152, 312, 173, 201, 147, 234, 184, 512, 543, 332, 217, 520, 5…
## $ api.stu  <int> 120, 270, 151, 179, 136, 189, 158, 419, 423, 303, 182, 438, 4…
## $ pw       <dbl> 18.925, 18.925, 18.925, 18.925, 18.925, 18.925, 18.925, 18.92…
## $ fpc1     <dbl> 757, 757, 757, 757, 757, 757, 757, 757, 757, 757, 757, 757, 7…
## $ fpc2     <int> <array[26]>

# Specify the design
apiclus_design <- svydesign(id = ~dnum + snum,
                            data = apiclus2,
                            weights = ~pw, 
                            fpc = ~fpc1 + fpc2)

#Look at the summary information stored in the design object
summary(apiclus_design)

## 2 - level Cluster Sampling design
## With (40, 126) clusters.
## svydesign(id = ~dnum + snum, data = apiclus2, weights = ~pw, 
##     fpc = ~fpc1 + fpc2)
## Probabilities:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003669 0.037743 0.052840 0.042390 0.052840 0.052840 
## Population size (PSUs): 757 
## Data variables:
##  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
##  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
## [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
## [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
## [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
## [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
## [37] "api.stu"  "pw"       "fpc1"     "fpc2"

Comparing survey weights of different designs

Remember that an observation’s survey weight tells us how many population units that observation represents. The weights are constructed based on the sampling design. Let’s compare the weights for the three samples of the api dataset. For example, in simple random sampling, each unit has an equal chance of being sampled, so each observation gets an equal survey weight. Whereas, for stratified and cluster sampling, units have an unequal chance of being sampled and that is reflected in the survey weight.

# Construct histogram of pw, for the simple random sample, apisrs.
ggplot(apisrs, aes(pw))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Construct histogram of pw, for the stratified sample, apistrat.
ggplot(apistrat, aes(pw))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Construct histogram of pw, for the cluster sample, apiclus2.
ggplot(apiclus2, aes(pw))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualizing the impact of survey weights

NHANES weights

Let’s explore the NHANES survey weights. As we said, minorities are more likely to be sampled. To account for this, their survey weights will be smaller since they represent less people in the population. Let’s see if this is reflected in the data.

# Some recreation from the above slide decks
dim(NHANESraw)

## [1] 20293    78

NHANESraw %>% 
  summarize(N_hat = sum(WTMEC2YR))

## # A tibble: 1 × 1
##        N_hat
##        <dbl>
## 1 608534400.

NHANESraw <- NHANESraw %>% 
  mutate(WTMEC4YR = WTMEC2YR/2)

NHANESraw %>% 
  summarize(N_hat = sum(WTMEC4YR))

## # A tibble: 1 × 1
##        N_hat
##        <dbl>
## 1 304267200.

#Create table of average survey weights by race

tab_weights <- NHANESraw %>%
  group_by(Race1) %>%
  summarize(avg_wt = mean(WTMEC4YR))

#Print the table
tab_weights

## # A tibble: 5 × 2
##   Race1    avg_wt
##   <fct>     <dbl>
## 1 Black     8026.
## 2 Hispanic  8579.
## 3 Mexican   8216.
## 4 White    26236.
## 5 Other    10116.

Tying it all together!

Now let’s dig into the NHANES design a little deeper so that we are prepared to analyze these data! Notice that I have re-created NHANES_design, which we created in the last video. Remember that it contains the design information for NHANESraw. The two important design variables in NHANESraw are SDMVSTRA, which contains the strata assignment for each unit, and SDMVPSU, which contains the cluster id within a given stratum. Remember that the clusters are nested inside the strata!

# Specify the NHANES design
NHANES_design <- svydesign(data = NHANESraw,
                           strata = ~SDMVSTRA,
                           id = ~SDMVPSU, nest = TRUE,
                           weights = ~WTMEC4YR)

distinct(NHANESraw, SDMVPSU)

## # A tibble: 3 × 1
##   SDMVPSU
##     <int>
## 1       1
## 2       2
## 3       3

# Print summary of design
summary(NHANES_design)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (62) clusters.
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## Probabilities:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 8.986e-06 5.664e-05 1.054e-04       Inf 1.721e-04       Inf 
## Stratum Sizes: 
##             75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
## obs        803 785 823 829 696 751 696 724 713 683 592 946 598 647 251 862 998
## design.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
## actual.PSU   2   2   2   2   2   2   2   2   2   2   2   3   2   2   2   3   3
##             92  93  94  95  96  97  98  99 100 101 102 103
## obs        875 602 688 722 676 608 708 682 700 715 624 296
## design.PSU   3   2   2   2   2   2   2   2   2   2   2   2
## actual.PSU   3   2   2   2   2   2   2   2   2   2   2   2
## Data variables:
##  [1] "ID"               "SurveyYr"         "Gender"           "Age"             
##  [5] "AgeMonths"        "Race1"            "Race3"            "Education"       
##  [9] "MaritalStatus"    "HHIncome"         "HHIncomeMid"      "Poverty"         
## [13] "HomeRooms"        "HomeOwn"          "Work"             "Weight"          
## [17] "Length"           "HeadCirc"         "Height"           "BMI"             
## [21] "BMICatUnder20yrs" "BMI_WHO"          "Pulse"            "BPSysAve"        
## [25] "BPDiaAve"         "BPSys1"           "BPDia1"           "BPSys2"          
## [29] "BPDia2"           "BPSys3"           "BPDia3"           "Testosterone"    
## [33] "DirectChol"       "TotChol"          "UrineVol1"        "UrineFlow1"      
## [37] "UrineVol2"        "UrineFlow2"       "Diabetes"         "DiabetesAge"     
## [41] "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
## [45] "Depressed"        "nPregnancies"     "nBabies"          "Age1stBaby"      
## [49] "SleepHrsNight"    "SleepTrouble"     "PhysActive"       "PhysActiveDays"  
## [53] "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"    "CompHrsDayChild" 
## [57] "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"      "SmokeNow"        
## [61] "Smoke100"         "SmokeAge"         "Marijuana"        "AgeFirstMarij"   
## [65] "RegularMarij"     "AgeRegMarij"      "HardDrugs"        "SexEver"         
## [69] "SexAge"           "SexNumPartnLife"  "SexNumPartYear"   "SameSex"         
## [73] "SexOrientation"   "WTINT2YR"         "WTMEC2YR"         "SDMVPSU"         
## [77] "SDMVSTRA"         "PregnantNow"      "WTMEC4YR"

# Number of clusters
NHANESraw %>%
  summarize(n_clusters = n_distinct(SDMVSTRA, SDMVPSU))

## # A tibble: 1 × 1
##   n_clusters
##        <int>
## 1         62

# Sample sizes in clusters
NHANESraw %>%
 count(SDMVSTRA, SDMVPSU)

## # A tibble: 62 × 3
##    SDMVSTRA SDMVPSU     n
##       <int>   <int> <int>
##  1       75       1   379
##  2       75       2   424
##  3       76       1   419
##  4       76       2   366
##  5       77       1   441
##  6       77       2   382
##  7       78       1   378
##  8       78       2   451
##  9       79       1   349
## 10       79       2   347
## # … with 52 more rows

Now it’s time to start analyzing survey data!

2. Exploring categorical data

Now that we have a handle of survey weights, we will practice incorporating those weights into our analysis of categorical data in this chapter. We’ll conduct descriptive inference by calculating summary statistics, building summary tables, and constructing bar graphs. For analytic inference, we will learn to run chi-squared tests.

Throughout this chapter and the next 2, you will use the NHANES dataset from the NHANES R package. The data are collected by the Center for Disease Control (CDC, the national public health institute in the United States) and can be thought of as a random sample of US residents.

You can Check the NHANES documentation

Visualizing a categorical variable

Summarizing a categorical variable

The NHANES variable, Depressed, gives the self-reported frequency in which a participant felt depressed. It is only reported for participants aged 18 years or older. Let’s explore the distribution of depression in the US.

# Specify the survey design
#Previously worked out
NHANESraw <- NHANESraw %>% 
  mutate(WTMEC4YR = WTMEC2YR/2)

NHANES_design <- svydesign(data = NHANESraw,
                           strata = ~SDMVSTRA,
                           id = ~SDMVPSU, nest = TRUE,
                           weights = ~WTMEC4YR)
# Determine the levels of Depressed
levels(NHANESraw$Depressed)

## [1] "None"    "Several" "Most"

# Construct a frequency table of Depressed
tab_w <- svytable(~Depressed, design = NHANES_design)

# Determine class of tab_w
class(tab_w)

## [1] "svytable" "xtabs"    "table"

# Display tab_w
tab_w

## Depressed
##      None   Several      Most 
## 158758609  32732508  12704441

Good work! Now let’s interpret our frequency table.

Interpreting frequency tables

A frequency table contains the counts of each category of a variable. Print out tab_w and table(NHANESraw$Depressed), which are both frequency tables but have different interpretations. Based on these tables, which of the following statements is FALSE?

tab_w

## Depressed
##      None   Several      Most 
## 158758609  32732508  12704441

table(NHANESraw$Depressed)

## 
##    None Several    Most 
##    7926    1774     814

Possible Answers - We estimate that over 12 million US adults feel depressed most days. - [x]Over 158 million sampled adults don’t feel depressed. - 814 people said they feel depressed most days. - tab_w contains the estimated frequency of self-reported depression for all US adults.

Good catch! tab_w isn’t telling us about the sample but about the population!

Graphing a categorical variable

It is a bit hard to compare the frequencies in tab_w since each frequency is a very large number. Let’s add proportions to the table and produce a bar graph to better understand the distribution of depression.

Add proportions to table

# Add proportions to table
tab_w <- tab_w %>%
  as_tibble() %>%
  mutate(Prop = n/sum(n))

# Create a barplot
ggplot(tab_w, aes(Depressed, Prop)) + 
  geom_col() +
  coord_flip()+
  scale_x_discrete(limits = tab_w$Depressed)

Nice graphic! Now it is time to roll another variable into our analyses.

Exploring two categorical variables

### Creating contingency tables

Let’s practice constructing contingency tables with svytable(). HealthGen is a categorical variable of the participant’s self-reported health for participants aged 12 years and older. Let’s look at the relationship between HealthGen and Depressed.

# Construct and display a frequency table
tab_D <- svytable(~Depressed,
                  design = NHANES_design)

tab_H <- svytable(~HealthGen,
                  design = NHANES_design)

tab_DH <- svytable(~Depressed + HealthGen,
                  design = NHANES_design)

Great job! Now let’s explore this relationship further.

Building segments bar graphs

“What is the probability that a person in excellent health suffers from depression?” and “Are depression rates lower for healthier people?” are questions we can start to answer by adding conditional proportions to tab_DH and creating a segmented bar graph.

Notice that we need to use ungroup() in our data wrangling below.

tab_DH_cond <- tab_DH %>%
  as_tibble() %>%
  group_by(HealthGen) %>% 
  mutate(
    n_healthGen = sum(n),
    Prop_Depressed = n/sum(n)
    ) %>% 
  ungroup()

tab_DH_cond

## # A tibble: 15 × 5
##    Depressed HealthGen         n n_healthGen Prop_Depressed
##    <chr>     <chr>         <dbl>       <dbl>          <dbl>
##  1 None      Excellent 21327182.   23761416.         0.898 
##  2 Several   Excellent  1870621.   23761416.         0.0787
##  3 Most      Excellent   563613.   23761416.         0.0237
##  4 None      Vgood     57487318.   67645678.         0.850 
##  5 Several   Vgood      8302494.   67645678.         0.123 
##  6 Most      Vgood      1855865.   67645678.         0.0274
##  7 None      Good      59920032.   78569449.         0.763 
##  8 Several   Good      13950469.   78569449.         0.178 
##  9 Most      Good       4698948.   78569449.         0.0598
## 10 None      Fair      17690783.   28981393.         0.610 
## 11 Several   Fair       7355105.   28981393.         0.254 
## 12 Most      Fair       3935506.   28981393.         0.136 
## 13 None      Poor       2324945.    5229274.         0.445 
## 14 Several   Poor       1253820.    5229274.         0.240 
## 15 Most      Poor       1650510.    5229274.         0.316

# Create a segmented bar graph of the conditional proportions in tab_DH_cond
ggplot(tab_DH_cond, aes(HealthGen, Prop_Depressed, fill =  as_factor(Depressed)))+
  geom_col()+
    coord_flip()+
  guides(fill = guide_legend(title = "Depressed"))

Based on your table and graphic, does there appear to be an association?

Summarizing with svytotal()

We can also estimate counts with svytotal(). The syntax is given by:

svytotal(x = ~interaction(Var1, Var2), 
         design = design,
         na.rm = TRUE)

For each combination of the two variables, we get an estimate of the total and the standard error.

# Estimate the totals for combos of Depressed and HealthGen
tab_totals <- svytotal(x = ~interaction(Depressed , HealthGen),
                     design = NHANES_design,
                     na.rm = TRUE)
# Print table of totals
tab_totals

##                                                       total      SE
## interaction(Depressed, HealthGen)None.Excellent    21327182 1556268
## interaction(Depressed, HealthGen)Several.Excellent  1870621  277198
## interaction(Depressed, HealthGen)Most.Excellent      563613  139689
## interaction(Depressed, HealthGen)None.Vgood        57487319 2975806
## interaction(Depressed, HealthGen)Several.Vgood      8302495  687020
## interaction(Depressed, HealthGen)Most.Vgood         1855865  269970
## interaction(Depressed, HealthGen)None.Good         59920032 3375068
## interaction(Depressed, HealthGen)Several.Good      13950469  931077
## interaction(Depressed, HealthGen)Most.Good          4698948  501105
## interaction(Depressed, HealthGen)None.Fair         17690783 1206307
## interaction(Depressed, HealthGen)Several.Fair       7355105  455364
## interaction(Depressed, HealthGen)Most.Fair          3935506  370256
## interaction(Depressed, HealthGen)None.Poor          2324945  251934
## interaction(Depressed, HealthGen)Several.Poor       1253820  168440
## interaction(Depressed, HealthGen)Most.Poor          1650510  195136

# Estimate the means for combos of Depressed and HealthGen
tab_means <- svymean(x = ~interaction(Depressed , HealthGen),
              design = NHANES_design,
              na.rm = TRUE)

# Print table of means
tab_means

##                                                         mean     SE
## interaction(Depressed, HealthGen)None.Excellent    0.1044492 0.0053
## interaction(Depressed, HealthGen)Several.Excellent 0.0091613 0.0014
## interaction(Depressed, HealthGen)Most.Excellent    0.0027603 0.0007
## interaction(Depressed, HealthGen)None.Vgood        0.2815422 0.0078
## interaction(Depressed, HealthGen)Several.Vgood     0.0406612 0.0028
## interaction(Depressed, HealthGen)Most.Vgood        0.0090890 0.0013
## interaction(Depressed, HealthGen)None.Good         0.2934563 0.0092
## interaction(Depressed, HealthGen)Several.Good      0.0683220 0.0033
## interaction(Depressed, HealthGen)Most.Good         0.0230129 0.0023
## interaction(Depressed, HealthGen)None.Fair         0.0866400 0.0047
## interaction(Depressed, HealthGen)Several.Fair      0.0360214 0.0026
## interaction(Depressed, HealthGen)Most.Fair         0.0192740 0.0019
## interaction(Depressed, HealthGen)None.Poor         0.0113863 0.0013
## interaction(Depressed, HealthGen)Several.Poor      0.0061405 0.0009
## interaction(Depressed, HealthGen)Most.Poor         0.0080833 0.0010

Good survey summarizing!

Interpreting svymean()

Can you explain the output of svymean()? Print out the svymean() object we created in the last exercise, tab_means. Which of the following is a correct interpretation of the first numeric entry?

Possible Answers - We estimate that 10.4% of people are not depressed. - We estimate that 10.4% of people believe their health is excellent. -[x] We estimate that 10.4% of people are not depressed and believe their health is excellent. - Only 10.4% of sampled participants answered depression and perceived health questions.

Nice work! Each entry is estimating the proportion of the population for a combination of the variables.

Inference for categorical variables

Running a chi squared test

Now it’s time to test whether or not there is an association between depression level and perceived health using a chi squared test. Recall that the p-value of the chi squared test tells us how consistent our sample results are with the assumption that depression and perceived health are not related.

# Run a chi square test between Depressed and HealthGen
svychisq(~Depressed + HealthGen, 
    design  = NHANES_design, 
    statistic = "Chisq")

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~Depressed + HealthGen, design = NHANES_design, statistic = "Chisq")
## X-squared = 1592.7, df = 8, p-value < 2.2e-16

Good work! Based on your test, does there appear to be a relationship between depression level and perceived health?

Tying it all together!

Now let’s look at a new example to practice what we learned this chapter! Let’s ask “Does homeownership vary by education level?” To answer this question, I want you to create a contingency table, visualize differences with a segmented bar graph, and then run a chi square test.

# Construct a contingency table
tab <- svytable(~HomeOwn + Education, 
                design = NHANES_design)

# Add conditional proportion of levels of HomeOwn for each educational level
 tab_df <- as_tibble(tab) %>%
  group_by(Education) %>%
  mutate(n_Education = sum(n),
         Prop_HomeOwn = n/n_Education) %>%
  ungroup()
 
 # Create a segmented bar graph
ggplot(tab_df, aes(Education, Prop_HomeOwn, fill = as_factor(HomeOwn )))+
geom_col()+
coord_flip()

# Run a chi square test
svychisq(~HomeOwn + Education,
         design = NHANES_design,
         statistic = "Chisq")

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~HomeOwn + Education, design = NHANES_design, statistic = "Chisq")
## X-squared = 531.78, df = 8, p-value = 2.669e-16

Well done! I think we are ready to move on to when our survey variable is quantitative.

3. Exploring quantitative data

Of course not all survey data are categorical and so in this chapter, we will explore analyzing quantitative survey data. We will learn to compute survey-weighted statistics, such as the mean and quantiles. For data visualization, we’ll construct bar-graphs, histograms and density plots. We will close out the chapter by conducting analytic inference with survey-weighted t-tests.

Survey statistics

How many hours of sleep does the average person in the US get each night? Does sleep vary by gender? Let’s explore these questions using the variables Gender and SleepHrsNight. SleepHrsNight contains the self-reported number of hours a participant usually sleeps on a weeknight and is only reported for participants who were at least 16 years old.

# Compute the survey-weighted mean
svymean(x = ~SleepHrsNight, 
       design = NHANES_design,
       na.rm = TRUE)

##                 mean     SE
## SleepHrsNight 6.9292 0.0166

# Compute the survey-weighted mean by Gender
svyby(formula = ~SleepHrsNight, 
    by = ~Gender, 
    design = NHANES_design, 
    FUN = svymean, 
    na.rm = TRUE, 
    keep.names = FALSE)

##   Gender SleepHrsNight         se
## 1 female      6.976103 0.02374684
## 2   male      6.879050 0.01953263

Nice summaries! How does your sleep compare to the national average??

Estimating quantiles

In addition to the mean, we may also be interested in a quantile such as the 50th quantile (i.e. the median). Recall that the 50th quantile is the value of the variable which is greater than 50 percent of the data. By exploring the quantiles of the data, we can get a better sense of how sleep hours is distributed across Americans. While the median tells us about the middle of our population, we might want to also know how much sleep the bottom 1% (1st quantile) or the top 1% (99th quantile) typically get.

# Compute the survey-weighted quantiles
svyquantile(x = ~SleepHrsNight, 
            design = NHANES_design, 
            na.rm = TRUE, 
            quantiles = c(0.01, 0.25, 0.5, 0.75, 0.99)
)

## $SleepHrsNight
##      quantile ci.2.5 ci.97.5        se
## 0.01        4      4       5 0.2457588
## 0.25        6      6       7 0.2457588
## 0.5         7      7       8 0.2457588
## 0.75        8      8       9 0.2457588
## 0.99       10     10      11 0.2457588
## 
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"

# Compute the survey-weighted quantiles by Gender
svyby(formula = ~SleepHrsNight, 
      by = ~Gender, 
      design = NHANES_design, 
      FUN = svyquantile, 
      na.rm = TRUE, 
      quantiles = 0.5, 
      keep.rows = FALSE, 
      keep.var = FALSE)

##        Gender statistic.statistic.SleepHrsNight.quantile
## female female                                          7
## male     male                                          7
##        statistic.statistic.SleepHrsNight.ci.2.5
## female                                        7
## male                                          7
##        statistic.statistic.SleepHrsNight.ci.97.5
## female                                         8
## male                                           8
##        statistic.statistic.SleepHrsNight.se
## female                            0.2457588
## male                              0.2457588
##        statistic.statistic.SleepHrsNight.quantile
## female                                          7
## male                                            7
##        statistic.statistic.SleepHrsNight.ci.2.5
## female                                        7
## male                                          7
##        statistic.statistic.SleepHrsNight.ci.97.5
## female                                         8
## male                                           8
##        statistic.statistic.SleepHrsNight.se
## female                            0.2457588
## male                              0.2457588

You clearly aren’t sleeping on the job!

Bar plots of survey-weighted means

Let’s return to the dataframe of the average sleep per night by gender, which we created in Exercise 3.2. I have saved the dataframe as out in the code chunk. It’s time to graph these data!

# Compute the survey-weighted mean by Gender
out <- svyby(formula = ~SleepHrsNight, 
             by = ~Gender, 
             design = NHANES_design, 
             FUN = svymean, 
             na.rm = TRUE, 
             keep.names = FALSE)
             
# Construct a bar plot of average sleep by gender
ggplot(out, aes(Gender, SleepHrsNight)) +
  geom_col() + 
  labs(
    y = "Average Nightly Sleep"
  )

Nice bar plot! Shall we add in uncertainty?

Output of svyby()

Both out and NHANESraw contain columns called SleepHrsNight and Gender. Print out and select(NHANESraw, SleepHrsNight, Gender).

out

##   Gender SleepHrsNight         se
## 1 female      6.976103 0.02374684
## 2   male      6.879050 0.01953263

select(NHANESraw, SleepHrsNight, Gender) # or NHANESraw %>% select(SleepHrsNight, Gender)

## # A tibble: 20,293 × 2
##    SleepHrsNight Gender
##            <int> <fct> 
##  1             4 male  
##  2            NA male  
##  3             8 male  
##  4            NA male  
##  5             4 female
##  6             4 male  
##  7             8 female
##  8            NA female
##  9            NA male  
## 10             6 male  
## # … with 20,283 more rows

Which of the following statements is incorrect?

Possible Answers

The most common response among surveyed males was 6.879 hours of sleep.
The first row in NHANESraw is a male who reported he sleeps 4 hrs/night.
The estimated average sleep hours for females (>12 yrs) is 6.98.
The standard error for the estimated average sleep hours for males (>12 yrs) is 0.0195.

You are right! out doesn’t contain the most common value but estimates of the average value!

Bar plots with error

In the video, we saw how to incorporate error bars into our bar plot to give the viewer a sense of the uncertainty in our estimates. We used error bars of +/- 1 standard error, but it’s also common to use +/- 2 standard errors since that roughly represents a confidence interval for the quantity of interest.

# Add lower and upper columns to out
out_col <- mutate(out, 
                  lower = SleepHrsNight - 2*se, 
                  upper = SleepHrsNight + 2*se)

# Construct a bar plot of average sleep by gender with error bars
ggplot(out_col, aes(x = Gender, y = SleepHrsNight, 
                     ymin = lower, ymax = upper)) +
  geom_col(fill = "gold") +
  labs(y = "Average Nightly Sleep") +
  geom_errorbar(width = 0.7)

Now that you can graph the means, let’s explore the shape of our data!

Survey-weighted histograms

If we want to understand the distribution of nightly sleep for Americans, we should construct a histogram. Let’s do that now and see the impact of changing the width of the bins.

# Create a histogram with a set binwidth of 1
ggplot(NHANESraw, aes(SleepHrsNight, weight = WTMEC4YR)) +
  geom_histogram(binwidth = 1,
                   color = "white") +
    labs(x = "Hours of Sleep")

## Warning: Removed 7261 rows containing non-finite values (stat_bin).

# Create a histogram with a set binwidth of 0.5
ggplot(NHANESraw, aes(SleepHrsNight, weight = WTMEC4YR)) +
  geom_histogram(binwidth = 0.5,
                   color = "white") +
    labs(x = "Hours of Sleep")

## Warning: Removed 7261 rows containing non-finite values (stat_bin).

Why are there now gaps in the histogram when the binwidth is changed to 0.5?

# Create a histogram with a set binwidth of 2
ggplot(NHANESraw, aes(SleepHrsNight, weight = WTMEC4YR)) +
  geom_histogram(binwidth = 2,
                   color = "white") +
    labs(x = "Hours of Sleep")

## Warning: Removed 7261 rows containing non-finite values (stat_bin).

How did the shape of the histogram change?

Such beautifully bell-shaped histograms!

Survey-weighted density plots

We can also explore the shape of a variable with the smooth curve of a density plot. Let’s create a density plot of nightly sleep where we facet by gender. Whereas the height of the histogram bars represent counts, the height of the density curve represents probabilities. Therefore, we will need to do some data wrangling before we create the plot.

# Density plot of sleep faceted by gender
NHANESraw %>%
    filter(!is.na(SleepHrsNight), !is.na(Gender)) %>%
    group_by(Gender) %>%
    mutate(WTMEC4YR_std = WTMEC4YR/sum(WTMEC4YR)) %>%
    ggplot(aes(x = SleepHrsNight, weight = WTMEC4YR_std)) + 
        geom_density(bw = 0.6,  fill = "gold") +
        labs(x = "Hours of Sleep") + 
        facet_wrap(~Gender, labeller = "label_both")

What nice smooth density curves!

Inference for quantitative data

Survey-weighted t-test

Now let’s tackle the question “Is there evidence that the average hours of nightly sleep differ by gender?” with a t-test.

# Run a survey-weighted t-test
svyttest(formula = SleepHrsNight ~ Gender,
       design = NHANES_design)

## Warning in summary.glm(g): observations with zero weight not used for
## calculating dispersion

## Warning in summary.glm(glm.object): observations with zero weight not used for
## calculating dispersion

## 
##  Design-based t-test
## 
## data:  SleepHrsNight ~ Gender
## t = -3.4077, df = 32, p-value = 0.001785
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.15506427 -0.03904047
## sample estimates:
## difference in mean 
##        -0.09705237

Great work! What conclusions can you draw from your t-test?

Tying it all together

Now let’s switch to a new question to practice some of the skills we learned in this chapter. Let’s investigate whether or not the total cholesterol varies, on average, between physically active and inactive Americans.

# Find means of total cholesterol by whether or not active 
out <- svyby(formula =~TotChol,
           by = ~PhysActive, 
           design = NHANES_design,
           FUN = svymean, 
           na.rm = TRUE, 
           keep.names = FALSE)

# Construct a bar plot of means of total cholesterol by whether or not active
ggplot(out, aes(PhysActive, TotChol)) +
  geom_col()

# Run t test for difference in means of total cholesterol by whether or not active
svyttest(formula = TotChol ~ PhysActive,
       design = NHANES_design)

## 
##  Design-based t-test
## 
## data:  TotChol ~ PhysActive
## t = -3.7936, df = 32, p-value = 0.0006232
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -0.20321950 -0.06122666
## sample estimates:
## difference in mean 
##         -0.1322231

Good work bringing together all these new skills! Let’s move on to modeling.

4. Modeling quantitative data

To model survey data also requires careful consideration of how the data were collected. We will start our modeling chapter by learning how to incorporate survey weights into scatter plots through aesthetics such as size, color, and transparency. We’ll model the survey data with linear regression and will explore how to incorporate categorical predictors and polynomial terms into our models.

Visualization with scatter plots

Bubble plots

Let’s explore the relationship between Height and Weight of Americans. We will focus on 20 year olds to minimize the effects of overplotting. We want to compare the look of the standard scatter plot to a bubble plot.

# Create dataset with only 20 year olds
NHANES20 <- NHANESraw %>% 
  filter(Age == 20)

# Construct scatter plot
ggplot(NHANES20, aes(Height, Weight))+
  geom_point(alpha = 0.3)+
  guides(size = "none")

## Warning: Removed 4 rows containing missing values (geom_point).

# Construct bubble plot
ggplot(NHANES20, aes(Height, Weight, size = WTMEC4YR))+
  geom_point(alpha = 0.3)+
  guides(size = "none")

## Warning: Removed 4 rows containing missing values (geom_point).

How do the plots compare? What additional information can be gleaned from the bubble plot?

Survey-weighted scatter plots

In scatter plots, we can also account for the sampling design through the saturation of color or the transparency of the points. Let’s see how these forms change the look of our scatter plot of height versus weight among 20 year olds.

# Construct a scatter plot
ggplot(NHANES20, aes(Height, Weight, color = WTMEC4YR))+
  geom_point()+
  guides(color = "none" )

## Warning: Removed 4 rows containing missing values (geom_point).

# Construct a scatter plot
ggplot(NHANES20, aes(Height, Weight, alpha = WTMEC4YR))+
  geom_point()+
  guides(alpha = "none" )

## Warning: Removed 4 rows containing missing values (geom_point).

Great scatter plotting! Do you think size, color, or transparency works the best at adding the survey weight information to the graphic?

Use of color in scatter plots

We often use the hue of color in a scatter plot to represent categories of a categorical variable. Let’s add Gender to our scatter plots.

# Add gender to plot
ggplot(NHANES20, aes(Height, Weight, size = WTMEC4YR, color = Gender))+
  geom_point(alpha = 0.3)+
  guides(size = "none")

## Warning: Removed 4 rows containing missing values (geom_point).

# Add gender to plot
ggplot(NHANES20, aes(Height, Weight, alpha = WTMEC4YR, color = Gender))+
  geom_point(alpha = 0.3)+
  guides(alpha = "none")

## Warning: Removed 4 rows containing missing values (geom_point).

What do your plots tell you about how the relationship between height and weight does or does not differs by gender?

Visualizing trends

Line of best fit

What’s the relationship between height and weight across all ages?

In this exercise, we will add the line of best fit to a bubble plot and see that the trend isn’t very linear. We can create a polynomial curve by adding the argument formula = y ~ poly(x, d) to geom_smooth() where d is the degree of the polynomial. For example, d of 2 adds a quadratic and a d of 3 adds a cubic.

# Bubble plot with linear of best fit
ggplot(NHANESraw, aes(Height, Weight, size = WTMEC4YR))+
  geom_point(alpha = 0.1)+
  guides(size = "none")+
  geom_smooth(method = "lm", se = FALSE, aes(weight = WTMEC4YR))+
# Add quadratic curve and cubic curve
  geom_smooth(method = "lm", se = FALSE, aes(weight = WTMEC4YR), formula = y~poly(x, 2),
              color = "orange")+
  geom_smooth(method = "lm", se = FALSE, aes(weight = WTMEC4YR), formula = y~poly(x, 3),
              color = "red")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2279 rows containing non-finite values (stat_smooth).
## Removed 2279 rows containing non-finite values (stat_smooth).
## Removed 2279 rows containing non-finite values (stat_smooth).

## Warning: Removed 2279 rows containing missing values (geom_point).

Nice trend curves! Notice that even with the non-linear curves, we are missing the trend for certain heights. This signifies that it is difficult to specify a global trend for these variables.

Trend lines

Let’s return to the bubble plot of Height versus Weight, colored by Gender, for 20 year olds. It’s now time to add trends lines and compare the trend lines when we do and do NOT account for the survey weights.

# Add non-survey-weighted trend lines to bubble plot
ggplot(NHANES20, aes(Height, Weight, size = WTMEC4YR, color = Gender))+
  geom_point(alpha = 0.1)+
  guides(size = "none")+
  geom_smooth(method = "lm", se = FALSE, linetype =2)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## Warning: Removed 4 rows containing missing values (geom_point).

# Add survey-weighted trend lines
ggplot(NHANES20, aes(Height, Weight, size = WTMEC4YR, color = Gender))+
  geom_point(alpha = 0.1)+
  guides(size = "none")+
  geom_smooth(method = "lm", se = FALSE, linetype =2)+
   geom_smooth(method = "lm", se = FALSE, aes(weight = WTMEC4YR))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 4 rows containing non-finite values (stat_smooth).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Removed 4 rows containing missing values (geom_point).

How do the survey weights impact the lines?

Modeling survey data

Regression model

Now it’s time for us to build a simple linear regression model for the Weight of 20-year-olds using Height. To do so, we first need to subset the design object NHANES_design to only include 20-year-olds while still retaining the original design structure.

# Subset survey design object to only include 20 year olds
NHANES20_design <- NHANES_design %>% 
  subset(Age == 20)

# Build a linear regression model
mod <- svyglm(Weight~Height, design = NHANES20_design)

# Print summary of the model
summary(mod)

## 
## Call:
## svyglm(formula = Weight ~ Height, design = NHANES20_design)
## 
## Survey design:
## subset(., Age == 20)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.2571    22.9836  -2.926  0.00674 ** 
## Height        0.8305     0.1368   6.072 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 326.6108)
## 
## Number of Fisher Scoring iterations: 2

Great modeling! Let’s look more closely at the summary information.

Regression inference

Print summary(mod) in the console and check out the coefficients table. Select the value of the test statistic if we are testing for a linear relationship between Height and Weight.

Possible Answers

-2.926
-67.257
0.831
6.072

Correct! And what does its corresponding p-value imply? Here, with a test statistic of 6.072 and a p-value of essentially zero, we have evidence of a linear trend/relationship between Height and Weight.

More complex modeling

Multiple linear regression

Now let’s roll Gender into our regression model of Weight using Height. Look at the scatter plot we made in a previous exercise. Does it seem like we should vary the slopes or not? Let’s practice building both a same slope model and a different slopes model

# Build a linear regression model same slope
mod1 <- svyglm(Weight ~ Height + Gender, design = NHANES20_design)

# Print summary of the same slope model
summary(mod1)

## 
## Call:
## svyglm(formula = Weight ~ Height + Gender, design = NHANES20_design)
## 
## Survey design:
## subset(., Age == 20)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -53.8665    22.7622  -2.366   0.0254 *  
## Height        0.7434     0.1391   5.346  1.2e-05 ***
## Gendermale    2.7207     3.2471   0.838   0.4095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 325.3881)
## 
## Number of Fisher Scoring iterations: 2

# Build a linear regression model different slopes
mod2 <- svyglm(Weight ~ Height * Gender, design = NHANES20_design)

# Print summary of the same slope model
summary(mod2)

## 
## Call:
## svyglm(formula = Weight ~ Height * Gender, design = NHANES20_design)
## 
## Survey design:
## subset(., Age == 20)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          9.5061    21.5357   0.441  0.66257   
## Height               0.3565     0.1269   2.809  0.00932 **
## Gendermale        -131.0884    41.9989  -3.121  0.00438 **
## Height:Gendermale    0.7897     0.2385   3.311  0.00273 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 316.5007)
## 
## Number of Fisher Scoring iterations: 2

Tying it all together

Let’s take a new modeling example from start to finish!

Let’s build a model to predict, BPSysAve, a person’s systolic blood pressure reading, using BPDiaAve, a person’s diastolic blood pressure reading and Diabetes, whether or not they were diagnosed with diabetes.

NHANESraw %>% 
  drop_na(Diabetes) %>%
  ggplot(aes(BPDiaAve, BPSysAve, size = WTMEC4YR, color = Diabetes))+
  geom_point(alpha = 0.2)+
  guides(size = "none")

## Warning: Removed 4600 rows containing missing values (geom_point).

  geom_smooth(method = "lm", se = FALSE, aes(weight = WTMEC4YR))

## mapping: weight = ~WTMEC4YR 
## geom_smooth: na.rm = FALSE, orientation = NA, se = FALSE
## stat_smooth: na.rm = FALSE, orientation = NA, se = FALSE, method = lm
## position_identity

# Build simple linear regression model
mod1 <- svyglm(BPSysAve ~ BPDiaAve, design = NHANES_design)

# Build model with different slopes
mod2 <- svyglm(BPSysAve ~ BPDiaAve * Diabetes, design = NHANES_design)

# Summarize models
summary(mod1)

## 
## Call:
## svyglm(formula = BPSysAve ~ BPDiaAve, design = NHANES_design)
## 
## Survey design:
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 85.74311    1.86920   45.87   <2e-16 ***
## BPDiaAve     0.48150    0.02354   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 290.3472)
## 
## Number of Fisher Scoring iterations: 2

summary(mod2)

## 
## Call:
## svyglm(formula = BPSysAve ~ BPDiaAve * Diabetes, design = NHANES_design)
## 
## Survey design:
## svydesign(data = NHANESraw, strata = ~SDMVSTRA, id = ~SDMVPSU, 
##     nest = TRUE, weights = ~WTMEC4YR)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          83.58652    2.05537  40.667  < 2e-16 ***
## BPDiaAve              0.49964    0.02623  19.047  < 2e-16 ***
## DiabetesYes          25.36616    3.56587   7.114 6.53e-08 ***
## BPDiaAve:DiabetesYes -0.22132    0.05120  -4.323 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 279.1637)
## 
## Number of Fisher Scoring iterations: 2

Good work! Now you are ready to analyze your own survey data!

Analyzing Survey Data in R

Nathan Onduma

26 June 2022

Introduction

Course Description

What is Covered

Libraries and Data

1. Introduction to Survey Data

What are survey weights?

Survey Weights

Visualizing Weights

Specifying elements of a Sampling design in R

Designs in R

Stratified designs in R

Cluster designs in R

Comparing survey weights of different designs

Visualizing the impact of survey weights

NHANES weights

Tying it all together!

2. Exploring categorical data

Visualizing a categorical variable

Summarizing a categorical variable

Interpreting frequency tables

Graphing a categorical variable

Add proportions to table

Exploring two categorical variables

Building segments bar graphs

Summarizing with svytotal()

Interpreting svymean()

Inference for categorical variables

Running a chi squared test

Tying it all together!

3. Exploring quantitative data

Survey statistics

Estimating quantiles

Bar plots of survey-weighted means

Output of svyby()

Bar plots with error

Survey-weighted histograms

Survey-weighted density plots

Inference for quantitative data

Survey-weighted t-test

Tying it all together

4. Modeling quantitative data

Visualization with scatter plots

Bubble plots

Survey-weighted scatter plots

Use of color in scatter plots

Visualizing trends

Line of best fit

Trend lines

Modeling survey data

Regression model

Regression inference

More complex modeling

Multiple linear regression

Tying it all together

Wrap-up